perm filename TANG.TEX[1,DEK]1 blob
sn#574966 filedate 1981-03-27 generic text, type C, neo UTF8
COMMENT ⊗ VALID 00009 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00002 00002 \input basic
C00004 00003 \section 1. The Problem.
C00018 00004 \section 2. A Dictionary-Independent System.
C00023 00005 \section 3. An Input Method.
C00031 00006 \section 4. Chinese Characters in Programming Languages.
C00036 00007 \vskip 8pt
C00038 00008 \topspace 0pt \vfill\vbox to 13truecm{\vss\ctrline{\bf Table 1}}
C00039 00009
C00045 ENDMK
C⊗;
\input basic
\magnify{1300}
\chcode'27=13 \def↔{\penalty999\ }
\font t=cmtt \def\.#1{\hbox{\:t#1}}
\def\chin{\penalty1000\quad}
\ctrline{\bf On the Problem of Inputting Chinese Characters}
\vskip .5truein plus .1truein
\ctrline{Chih-sung Tang}
\ctrline{(Department of Computer Science, Stanford University)}
\ctrline{(Institute of Computing Technology, Academia Sinica)}
\vskip .5truein plus .1truein
\botinsert{\vskip 8pt\hrule width 2truein\vskip 4pt \baselineskip 8pt\:d
\hbox par size{The publication of this report was supported in part by NSF grant
IST-7921977.}}
\def\section #1. #2.{\vskip 18pt
\noindent{\bf#1. #2.}\par\penalty1000\vskip 6pt\noindent\!}
\section 1. The Problem.
Many people are currently seeking a good way to do input and output
of Chinese characters on computers.
It seems to me that a lot of confusion about this problem is also prevalent.
Some of the confusion may be due to mismatching the means with the end; and
some may stem from a misunderstanding
of the problem. Basically, people are apt to think of input and output with
Chinese characters as a simple generalization of the input/output problem for
conventional characters, while in fact it is not.
For example, inputting and outputting of Chinese characters
are themselves two separate information processing problems.
I think that a clarification of the problem may be a helpful step towards its
solution.
People often say that the inputting and outputting of Chinese characters is an
important and difficult problem, at least for Chinese-speaking people.
But why is it important?
Why difficult? I think the most convincing answer to the first question
might be that without Chinese characters there are a great many important
applications of non-numeric information processing
in business, management, information
retrieval, etc., that
cannot be well developed in the Chinese-speaking society.
So when
we talk about the inputting and outputting of Chinese characters, what we mean is to
do it in the context of an information processing program, not just to input or
output the characters as an end in itself.
In this sense, Chinese character input/output is quite a different problem from
what we may call a Chinese character writing system.
I once saw a system of the latter
kind in a famous research laboratory. The user can write a Chinese character stroke
after stroke before the terminal, just like we write a Chinese character on paper.
The author also claims that his work deals with the input problem for Chinese
characters! This kind of work may have its significance in the research of graphics
or artificial intelligence, but it is far from becoming a practical input means for
a real application since few programs can be allowed to receive their
input at such a slow speed.
Furthermore, the problem of inputting and outputting Chinese characters in an
information processing program is a different problem from building a Chinese
typewriter;
the former need not presuppose the latter.
People accustomed to considering the input and output of
Chinese characters as a generalization of the input and output of
conventional characters are quick to identify the input/output problem with that
of building a typewriter.
But in fact, it is more difficult to build a typewriter,
because (1) a typewriter must deal with an universal dictionary,
(2) a typewriter cannot take advantage of the
fact that input/output can be treated as a subproblem in a larger
information processing environment, and
(3) a typewriter that can type only Chinese
is not useful in information processing, while one that
can type both Chinese and conventional characters has too big a keyboard.
Let me explain these points in more detail. In an ordinary information processing
problem, the vocabulary is always limited
for a special application. In most cases
we could even force the user to use only a certain restricted special-purpose
dictionary. In other applications such as a library system, it appears
that a universal dictionary may be required, yet
it is still possible to make use of the principle of ``divide and
conquer'' by partitioning a large dictionary into several special-purpose ones,
for example by sorting the books of a library into different subcategories.
In this way, many difficult problems can be ameliorated.
When we treat the input/output problem as a subtask in an information
processing program, it is possible not only to specialize the vocabulary
with respect to the problem at hand,
but also to treat the dictionary as a private resource:
A user can even establish a dictionary
according to his or her own personal taste, and one user might have
several dictionaries, each corresponding to one special inputting method.
Thus the binding time for the choice of an input method can be delayed to the
very time of input.
This is a very important concept in the approach that we will discuss below.
The last but not the least important point we shall emphasize is that any Chinese
input/output system must be able to input and
output conventional characters also; otherwise
it would be only a Chinese text copying system and would be of little
relevance to information processing.
No information processing program can be written with only
Chinese characters and no other symbols.
Indeed, I once saw a computer system equipped
with such a system, and
I doubt if the system has ever been used in any information processing program.
\vskip 3pt plus 3pt
Next, inputting and outputting of
Chinese characters must be considered as separate
information processing problems.
The only connection required between them is to find
a one-to-one or many-to-one correspondence between the input code of a word and the
the internal representation of its output code.
Some people try to impose an exact
structural image of the output pattern onto its input code, and
this always makes the input code unnecessarily complicated.
The output problem itself should be divided into two distinct parts: (1) to
output recognizable Chinese characters quickly and efficiently, and (2)
to deal with characters that
are aesthetically beautiful. Both of these problems have by now
been resolved satisfactorily, at least to some extent,
but the input problem is still stubborn. So the remainder
of this paper will concentrate on the input problem.
Thus, the problem we are going to discuss is {\it to find an easy, quick, and
convenient way to input Chinese characters together with other conventional
characters,
in the context of an information processing environment, with the result that the
input code of each Chinese character can be used to identify the internal
representation of its output code almost uniquely.}
Is this an easy problem?
If we neglect the ``easy, quick, and convenient'' requirements
above, the problem is not so hard as it seems. For example,
I know that in some places there have
been Chinese telegraph operators who can readily recite all the digital
codes of Chinese characters. To them the inputting problem is obviously trivial.
But to simplify the
input problem at the expense of asking the user to be trained to remember too
many conventions is not an acceptable approach,
for a method that is accessible only
to highly trained experts can never become popular.
Speed is another important consideration.
It is well known that an English word contains about 4 to 5 letters, on the average,
so it is reasonable to expect that each Chinese character be input
with no more than 4 or 5 keystrokes.
On the other hand, speed is only one of the criteria by which a method can be
judged; it would be a mistake to
lay too much emphasis on speed at the expense of other qualities.
``Convenience'' is another important requirement: No method would be acceptable if
it were hard to handle.
Once I saw a system that can input both Chinese and conventional
characters; since the keyboard is limited (too big a board being inconvenient
to operate), each key must be used to represent two or more conventional characters
together with 2 or 3 different Chinese radicals. You can imagine what a mess
this is! I am afraid that it would take very long
for most people to find one key, and I suspect there would be many printing
errors on each page. Can this be considered a good design?
Finally, the uniqueness of representation by which each output code is identified
from the input code is another crucial point. In fact, until now no significant
method exists that is
able to establish absolute uniqueness, and some systems suffer from too
many ambiguities. For example, the idea of inputting
Chinese characters according to their Pinying spelling
often has this defect. (Incidentally, the Pinying method has still other defects;
for example,
Chinese people speak too many different kinds of dialects or non-standard mandarins.)
A good method must satisfy two requirements: (i) The percentage of exceptional
ambiguities should be at most a certain acceptable limit,
e.g., $1.5\%$; (ii) the system should
afford a good method to treat these exceptional cases.
Consequently, judged by these standards, Chinese character inputting is indeed a
difficult problem.
\section 2. A Dictionary-Independent System.
It seems to me that there is a very simple but useful concept that has
been almost neglected
by most systems I have ever encountered,
namely to distinguish and separate the input
system (the program to accept the input code) from the input method (the method of
expressing each Chinese character by its input code or the method of
striking the keys).
For this reason alone, many systems are built according to one fixed input method.
It is even
worse to build a mechanical-electronic input device in this way. Such a design
is bad not only because no input method can as yet really be considered
as the best and in need of no further modification, but also (perhaps more
importantly) this approach itself leaves the method without flex\-i\-bil\-ity.
Since the basic problem is to establish a one-to-one or
many-to-one correspondence between the input code and an
internal representation of the
output code of each word, all such information is contained in a dictionary. As
long as the user inputs Chinese and other characters using a method consistent
with whatever dictionary is being used,
the requirement is fulfilled, and the program system is implemented
to search the dictionary.
So it is natural to implement it as dictionary-independent
software. If a system is constructed in this way, any
change of input method requires only a
change of dictionary. The system is always stable; more importantly, such a
system is flexible in the following
practical ways: (1) One system can adapt to many different
input methods. (2) Different applications can have different dictionaries all
residing in one system.
(3) The dictionary can always be extensible.
(4) Each word or character can have several different input codes.
(5) Even when one method is considered as the best,
different users or different situations may require some minor alterations, and
such modifications are easily tolerated.
Users who wish to input some particular words in a novel way, according to
their special taste, can do so, even if they choose to use completely different
principles for some characters than for others.
For example,
there may be a user whose dictionary is built according to the method I shall
introduce
in the next section; but when special ambiguous
words like ``\chin'', ``\chin'', ``\chin'', and ``\chin''
occur, it would be perfectly right for the user to input these words with
some other method that seems natural (e.g., Pinying without four tones).
In order to manage such a system, we suggest a command of the following form:
$$\hbox{\.{@HANZ(}$\langle$file name$\rangle\.;\langle$extensible part$\rangle\.)$}$$
Here the ``file name'' denotes the dictionary to be used, and
the ``extensible part'' specifies
input/output code pairs for new
words not contained in the dictionary or new input codes for words already in it.
\section 3. An Input Method.
In this section, a particular input method will be introduced
that seems to me able
to fulfill the demands listed in previous sections. This method consists of
three basic points: (i) It illustrates a new way to define the constituents
that form a
Chinese character. (ii) It also provides a way to arrange the constituents of a
word into linear order. (iii) It does not require an exhaustive listing of all
parts of a Chinese character, when partial information is sufficient.
The method for defining the constituents of Chinese characters is based on an
interesting idea that
Chang Hung-fang↔[1] has introduced to search for Chinese words in a dictionary.
It seems to me that is the method not very successful for his original purpose (to
search in a dictionary), but it will indeed be
very useful for inputting Chinese characters.
His basic idea is to think of the constituents of Chinese characters as if they
were similar in shape to Latin letters.
For example, the constituent ``\chin'' (side ear) looks very
much like a letter ``p'', and ``\chin'' looks a lot like the
letter ``k'', etc. Obviously, these
constituents do not correspond exactly with traditional Chinese radicals,
but they have the following virtues:
(i)↔To code Chinese characters in this way, we can use
the same keyboard to input both conventional symbols and Chinese constituents.
(ii)↔Since both types of symbols have
similar shape, it is easy for keyboard operators to remember.
(iii)↔Each of these constituents
can cover a rather large part of a word, so four or fewer constituents will
usually suffice to identify a word.
Adopting Chang's basic idea, we can modify it slightly and generalize it
to other conventional characters that appear on standard keyboards, as in
Table↔1 at the end of this report. (Table↔1 is still subject to modification; the
user can make his own conventions according to the principles discussed
above.) Chang calls the constituents the {\it shape constructs},
or simply ``constructs'', of the character.
To arrange the constituents of a word into linear order, we have tried several
methods (cf.\ [2]).
The best of these seems to be to arrange them by
successively cancelling
the largest construct encountered at the four corners of a word,
proceeding clockwise from the upper right corner.
If the word has been exhausted before
four steps are completed, it ends with fewer than four symbols, otherwise
it is encoded by four nonblank symbols. Let us look at some examples:
\def\\#1{{\baselineskip 6pt\vbox{\halign{\hfil\.{##}\hfil\cr#1\cr$\→$\cr}}}}
$$\baselineskip 30pt
\vbox{\halign{\qquad#\hfil⊗\qquad#\hfil⊗\qquad#\hfil⊗\qquad#\hfil⊗\qquad\qquad
code = \.{#}\hfil\cr
\\A⊗\\P⊗\\O⊗⊗APO\cr
\\D⊗\\h⊗\\g⊗⊗Dhg\cr
\\n⊗\\n⊗\\{\char'17}⊗\\T⊗nn\char'17 T\cr}}$$\eject
$$\baselineskip 30pt
\vbox{\halign{\qquad#\hfil⊗\qquad#\hfil⊗\qquad#\hfil⊗\qquad#\hfil⊗\qquad\qquad
code = \.{#}\hfil\cr
\\r⊗\\Z⊗\\*⊗\\-⊗rZ*-\cr
\\l⊗\\!⊗\\{\char'17}⊗\\|⊗l!\char'17 |\cr
\\L⊗\\n⊗\\|⊗\\s⊗Ln|s\cr
\\n⊗\\l⊗\\m⊗\\{\char'43}⊗nlm\char'43\cr
\\D⊗\\{\char'17}⊗\\T⊗\\r⊗D\char'17 Tr\cr}}$$
In each case the first character is removed at the upper right, the second at
the lower right, the third at the lower left, and the fourth at the upper left.
The last example seems not very satisfactory, for it has too big a remainder.
But in fact, its input code is already sufficient to identify the internal
representation. Fewer ambiguities would arise if we were to
choose five steps as the maximum for each word.
Note that about 90 conventional symbols are also considered to
be special ``Chinese characters,''
i.e., Chinese characters of one constituent, and
there are a few real Chinese characters that happen to
contain only one constituent; these might be confused with conventional characters.
In order to avoid such confusion, we adopt the convention that
Chinese characters having only one constituent are always preceded by the symbol
``\.1''. For example,
``\.{1R}'' is ``\chin''; ``\.{1K}'' is ``\chin''.
But there are some symbols corresponding to more
than one character.
In these cases, we can use ``\.{12}'', ``\.{13}'', ``\.{14}'' as prefixes,
to distinguish them according to their order of occurrences in Table 1;
e.g., ``\.{1m}'', ``\.{12m}'', ``\.{13m}'', and ``\.{14m}''
represent ``\chin'', ``\chin'', ``\chin'', and ``\chin'', respectively.
Similarly, we can use ``\.2'', ``\.3'',
``\.4'' to express repetition. For $i=2,$ 3, 4, ``$iπ$'' means to repeat the
``$π$'' $i$ times; e.g.,
``\.{2+\char'17}'' is ``\chin''; ``\.{3+\char'17}'' is ``\chin'';
``\.{3o}'' is ``\chin''.
\section 4. Chinese Characters in Programming Languages.
Since the Chinese characters of interest are those occurring in an information
processing program, it is natural to consider the information processing of Chinese
characters as a problem related to programming languages. We have considered this
problem in two ways: (i) To design a new language with abundant facilities to do
Chinese character information processing. (ii) To implement a Chinese version of
existing famous programming languages, e.g., Chinese Cobol, Chinese Pascal, etc.
For problem (i), we have designed a language system with this goal in mind. The
XYZ family of languages contains the following features:
\vskip 3pt
\item{(1)}In order to diminish the number of reserved Chinese words, we never use
natural language words as grouping symbols like ``{\bf begin $\ldots$ end}'';
instead, we always
use mathematical brackets for this purpose. Furthermore,
we always use one reserved
word together with a mathematical
symbol to replace multi-word delimiters; e.g., we use the
structure ``BRANCH$[B{:}\,S↓1;{:}S↓2 ]$'' instead of ``{\bf if $B$ then $S↓1$ else}
$S↓2$'', etc.
\vskip 3pt
\item{(2)}Every reserved word has both a Chinese counterpart and an English counterpart;
sometimes they also have a math symbol as the third choice. The user can choose
among
them optionally. Thus, the reserved word for looping
has three counterparts: ``{\bf Iterate}''
or ``\chin'' or ``\.{*}''.
\vskip 3pt
\item{(3)}The user can define both Chinese and Latin identifiers.
But we expect the users not to
let Chinese identifiers be the name of functions, for this is never done in
traditional Chinese mathematics textbooks;
the use of Chinese for function names would not be readable.
\vskip 3pt
\item{(4)}The language contains a new type CHINESE; a variable of this type can be
assigned a Chinese string as its initial value.
\vskip 3pt
By means of the input method shown in previous sections, we have been implementing
a Chinese Cobol. We treat it in following simple way: Suppose we already have a
compiler for English Cobol in the computer; we implement a preprocessor to accept the
program written by Chinese Cobol. It translates the program into English Cobol,
prints out both texts,
and then transfers to the existing compiler to compile and execute. If
the program contains user-defined identifiers for which the user wants to have a
readable English correspondent, he or she must input a table indicating
the correlation; otherwise
the system will give each such identifier a standard name such as \.{XX}$i$.
So our
input method can help us to solve such problems as creating a
Chinese Cobol, and these problems
are often considered difficult to deal with.
\vskip 8pt
\noindent{\bf Acknowledgments.}
An input system has been implemented by Jianxin Wang in The Institute
of Computing Technology, Academia Sinica, Beijing, along the lines suggested
here; the research for this paper has been finished at Stanford in 1981.
The author wishes to express his gratitude
to Profs.\ McCarthy and Knuth for their support.
\vskip 8pt
\ctrline{\bf References}
\vskip 3pt
\item{[1]}Chang Hung-fang, ``A method to search for words in a
Chinese dictionary.'' Draft.
\vskip 3pt
\item{[2]}Tang Chih-sung and Wang Jiangxin, ``Chinese character inputting and higher
level languages.'' {\sl Chinese Journal of Computers \bf 1} (1979).
\vfill\eject
\topspace 0pt \vfill\vbox to 13truecm{\vss\ctrline{\bf Table 1}}
\vfill\end
Dear Dr. Tang,
I have made a TEX version of your paper on Chinese character input;
while doing this I took the liberty of editing it. Although your use
of English is quite clear, I tried to make the wording "really
excellent" as would read most smoothly to a native speaker. In the
process, I may have unintentionally distorted your meaning, of course;
and perhaps it would even be better if the English were not quite
so fluent, because we surely want Chinese people to be able to read this!
Anyway, you can make yourself a copy of the paper in its current form
by giving the command ".dover tang.press[1,dek]".
I have several questions that I think should be answered before we issue
this as a Stanford report:
1) More complete information should be given in the references. For reference
number [1], an approximate date, and the author's location, should be given.
For reference number [2], please supply the page numbers.
2) I think you should also include a reference to your XYZ system (including
title, journal, volume, date, pages).
3) In one place you spelled it "Pingying" and in the other places "Pinying".
I think I have usually seen spelling "Pinyin". What is correct?
4) I think it simplifies the illustrations of your input coding scheme
if the numbers "1" "2" "3" "4" are left off the diagrams.
So I added an explanatory sentence after those diagrams.
5) Please tell me if I haven't left enough space for you to draw in the characters!
6) I don't think the use of 2,3,4 for repetition is such a good idea. For one
thing, I see that 5 is a construct; so there is some inconsistency. Secondly,
why not write "ooo" instead of "3o"? Then the user needn't remember so many
rules. In the other examples, like "3+∧", you seem to be using +∧ as a single
construct, which violates the spirit of your proposal to work on one corner
at a time. Thus I feel the section on repetition "iπ" would be best eliminated.
7) I have only one "major" suggestion, namely that you should be careful to
stick to a standard ASCII keyboard if possible. This has many advantages,
especially that ASCII has been a standard for so long that a great many
manufacturers now make the keyboards and many people know how to use them, in all
parts of the world. Some of the symbols you have in Table 1 are to be found
only on the SAIL keyboards, which are quite rare except at Stanford. And
it also looks like you have a Japanese yen sign? In the examples of the
paper, I have used the ASCII character ∧, which is raised (it is a
circumflex or "hat" accent), in order to make sure that everything could be
done on the common computer terminals. You should not use symbols like logical
not (¬), etc.; the available symbols, besides 26+26+10 letters and digits, are:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ∧ _ ` { | } ~
(they are listed, for example, as codes octal41 to octal176 on page 168 of
my TEX manual). I really believe that if you make a few changes to Table 1
so that everything is ASCII, your system will immediately be a lot more
popular. Incidentally, the lower case letter l (ell) does NOT look like a script
letter as it does in Table 1, so that should probably be changed; this affects
some of the examples in section 3.
8) In the first paragraph of the paper, I cannot understand your phrase
"some of the confusion may be due to mismatching the means with the end".
Since the first paragraph is the most important of all, please tell me
in more detail what you mean here!
Please let me know of any changes you would like to make, and/or if you wish
to wait a little before this is printed.